How to use BLCR
Administration
https://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml
download source and manual
after restart - load modules
# service blcr start
check blcr is running
# lsmod | grep blcr
Make application checkpointable
Option 1:
$ cr_run <myexecutable>
Option 2:
add switch -lcr during compiling your executable
$ ifort -o ... ... ... -lcr
Make checkpoint
just make checkpoint in case something gets wrong
$ cr_checkpoint <pid>
make checkpoint and terminate process
$ cr_checkpoint --term <pid>
$ cr_checkpoint --kill <pid>
if you want restart process on another machine, this results in biger file but saves also executable and libraries add switch "--save-all"
$ cr_checkpoint --save-all --term <pid>
Process state is saved to file "context.<pid>"
Restart process
$ cr_restart --no-restore-pid <context file>
see Section 4.4 "Restarting the Process" from
https://upc-bugs.lbl.gov//blcr/doc/html/BLCR_Users_Guide.html
for information how open files are treated
Migration between machines
To migrate between machines, they have to have the same version of the kernel running.
Problems yet to solve
- checkpointing whole scripts (not single processes)
- integration with sge clulster to checkpoint automatically